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The secure hash function has become the default choice for information 
security, especially in applications that require data storing or manipulation. 
Consequently, optimized implementations of these functions in terms of 
Throughput or Area are in high demand. In this work we propose a new 
conception of the secure hash algorithm 3 (SHA-3), which aim to increase the 
performance of this function by using pipelining, four types of pipelining are 
proposed two, three, four, and six pipelining stages. This approach allows us 
to design data paths of SHA-3 with higher Throughput and higher clock 
frequencies. The design reaches a maximum Throughput of 102.98 Gbps on 
Virtex 5 and 115.124 Gbps on Virtex 6 in the case of the 6 stages, for 512 bits 
output length. Although the utilization of the resource increase with the 
increase of the number of the cores used in each one of the cases. The 


High throughput proposed designs are coded in very high-speed integrated circuits program 
Keccak (VHSIC) hardware description language (VHDL) and implemented in Xilinx 
Pipelining Virtex-5 and Virtex-6 A field-programmable gate array (FPGA) devices and 
SHA-3 compared to existing FPGA implementations. 
This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

Due to the attacks on the message-digest algorithm 5 (MDS) and secure hash algorithm (SHA-1) hash 
algorithms [1], [2] the National Institute of Standards and Technology (NIST) has organized a public 
competition to develop a new hash standard, in which 64 candidates’ algorithms were submitted to NIST for 
consideration. Among these 51 met the minimum acceptance criteria marking the beginning of the first round 
of the SHA-3 Competition. The selection of the candidates was based on various evaluation criteria, only 14 
were chosen for the second round of the competition. Keccak has been selected as the SHA-3 standard [3]. 
Keccak is based on a sponge construction which differs from the Merkle—Damgard construction used by MD5, 
SHA1, and SHA-3 [4]-[6]. With this structure, Keccak is more secure than the existing standards [7]. During 
and after the NIST's competition several implementations of SHA-3 are presented [8]—[17], [18]. In spite of 
the availability of software implementations of security algorithms using high powerful central processing unit 
(CPUs) [19], hardware implementations of these algorithms using high developed platforms such as field- 
programmable gate arrays (FPGAs) and application specific integrated circuit (ASICs) remains greatly 
demanded. 

Optimized hardware implementations of these algorithms in terms of area or throughput are in high 
demand, depending on whether it concerns slow or fast applications, consequently, these implementations can 
be optimized either in terms of low area (lightweight implementations) or higher throughput (high-performance 
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implementation). The first category of these designs is optimized for compact designs by merging Rho, Pi, and 
Chi steps of the algorithm into a single step [12]-[16]. In the same line of thought, Jungk and Stottinger [20] 
have decreased the area cost by using shift register primitives of the FPGA instead of the distributed random 
access memory (RAM). These algorithms can also be optimized for high Throughput, by applying several 
optimization techniques such as loop unrolling [16] and pipelining [11]-[20], [21]. 

In the case of the SHA-3 hash algorithm, the pipeline technique can be employed in SHA-3 in two 
different manners. An internal pipelining or sub-pipelining, where the registers are introduced inside the hash 
compression function, Athanasiou et al. [11] introduce a pipeline stage between the z and the y sub-rounds of 
the algorithm while Mestiri et al. [14] inserted two registers between the sub-functions: the first register is 
inserted between the Pi and Chi, and the second register is implemented at the end of the Keccak round. The 
principal disadvantage of this type of pipelining lies in the fact that the transformations of the next block cannot 
start before the output of the previous block becomes available. The second plausible means of pipelining are 
external pipelining. It’s done by placing pipeline registers between two successive rounds, where each stage 
contains the entire compression function [15]. 

Presently, cryptographic hash function plays a critical role in many recent fields such as 
radiofrequency identification (RFID) [22], internet of things (IoT) [23]-[26], medicine [27], wireless sensor 
network (WSN) [28], [29], and cloud computing [30], [31]. Consequently, the enhancement of the speed/ 
performance of the SHA-3 in hardware platforms is in great demand. e.g., in the IoT where billions of objects 
need to be connected, data integrity is an indispensable factor. Hardware implementations of the cryptographic 
hash functions are greatly demanded, so as to secure the communication process between the interconnected 
objects constituting the IoT applications, mainly in a public network. 

In this study, our main focus is increasing the performance of the SHA-3, by applying the pipelining. 
We have proposed four different pipelined designs, two, three, four, and six-stage pipelines. These designs are 
capable of handling multi-block messages and processing multiple messages. The remainder of this paper is 
organized as follows. The second section briefly introduces the SHA-3 specifications; section 3 presents the 
proposed designs of the SHA-3 hash function; the FPGA synthesis results and comparisons with previously 
published works are provided in section 4; while the paper’s conclusions are discussed in the last section. 


2. SHA-3 SPECIFICATIONS 

The SHA-3 algorithm, initially known as Keccak, is a cryptographic hash function designed by 
Bertoni et al. [7] unlike SHA-1 [5] and SHA-2 [6], SHA-3 does not depend on the Merkle—Damgard 
construction. The SHA-3 depends on sponge construction. The purpose of using this structure is its security 
against generic attacks. Moreover, this construction is characterized by its simplicity and adaptability. 

The sponge construction is based on two phases, the first one is the absorption phase where the state, 
a matrix of SHA-3 composed of 5x5 matrixes of 64-bit words, receives the null matrix then it will be XOR-Ed 
with the first r-bit message block and the state will be regenerated. This process continues until all the blocks 
have been retained. The second phase is the squeezing phase, where the output hash value is truncated from 
the first r-bit, and further changes are made if the required output bit is not acquired. 

For the two phases, the same function f is being utilized. Figure 1 demonstrates how the sponge 
construction absorbs the message blocks M and how the outputs Z are generated. The sponge construction 
permits arbitrary-length outputs. 


M 


Absorbing 1 Squeezing 


Figure 1. Absorbing and squeezing phases of the sponge construction 


High-performance field-programmable gate array implementation of ... (Fatimazahraa Assad) 


1326 O ISSN: 2088-8708 


The SHA-3 family consists of four cryptographic hash functions, SHA3-224, SHA3-256, SHA3-384, 
and SHA3-512 [3]. These functions share the same structure, namely, the sponge construction. Depending on 
the desired output length, two parameters are used for the sponge construction. These two input parameters are 
the bitrate r or the length of one message block and C is the capacity where r+C=b, these parameters can be 
chosen by the user. The security of the SHA-3 algorithm can be relatively enhanced by increasing the capacity 
C and reducing the bitrate r as shown in Table 1. 

For our design we have taken into consideration the 4 cases, depending on the desired output length. 
It should be noted that the security of the hashed message can be varied by changing the hash result length. 
Keccak comprises 24 rounds, each round is sub-divided into five steps: 0(Theta), e(Rho), a (Pi), (Chi) and 1 
(Iota) as shown in Figure 2 


Table 1. The capacity-bitrate values 
Algorithm r (bits) C (bits) Hash output (bits) 


SHA-224 1152 448 224 
SHA -256 1088 512 256 
SHA -384 832 768 384 
SHA -512 576 1024 512 


C id © d o d o ee 


Figure 2. The inner construction of Keccak-f 
The mathematical formulas of the five sub-functions (9, p, 2, y, and t) are presented: 
Theta (©) step: 
C(x, z)=Theta_in(x, 0, z) ® Theta_in (x, 1, z) ® Theta_in (x, 2, z) ® Theta_in (x, 3, z) ® Theta_in 
(x, 4, Z) 
D(x, z)=C[(x—1)mod5, z] ® C[(x + 1)mod5, (z — 1)mod w] 
Theta_out(x, y, z)=Theta_in (x, y, z) ® D(x, z) 
Rho (Q) step: 
Rho_out (0, 0, z)=Rho_in(0, 0, z) 
Let (x, y) = (1, 0) 
Rho_out (x, y, z)=Rho_in [x, y, (z — (t + 1)(t + 2) /2)mod w] 
(x, y)=(y, (2x + 3y) mod 5) 
Pi (7) step: 
Pi_out(x, y, z)=Pi_in [(x+3y) mod 5, x, z] 
Chi (x) step: 


Chi_out(x, y, z)=Chi_in(x, y, z) ® [NOT[Chi_in ((x + 1) mod 5, y, z)] 
AND Chi_in ((x+2)mod 5, y, z)] 
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Iota (1) step: 
Iota_out(0, 0, z)=Iota_in(0, 0, z)  RC() 


RC is a 64-bit value prefixed for each round. The Keccak-f permutation comprises 24 rounds, which are 
indistinguishable except for the addition of a round-dependent constant. The constants RC[i] are the round 
constants, where i is the round’s number. 


3. RESEARCH METHOD 

The basic hardware architecture of the SHA-3 hash function has a 64-bit data input and 4 inputs that 
control the overall. Thus, a 512-bit (384-bit, 256-bit, or 128 bit) data output and 1-bit output indicates that the 
hash is ready. The message block has been treated from the beginning until the generation of the hash value, 
including the padding Unit, Moreover, it is not based on any FPGA resources such as Digital signal processors 
(DSPs). As appeared in Figure 3 we applied four types of padding taken into account all different output lengths 
(128, 256, 384, and 512) to build a complete design of the Keccak hash algorithm. The SE signal allows 
selecting the desired output length. 


—>| Padding 224 [tts 
—> Padding 256 Lane) 
M £$ < 
x< 
> Padding 384 = = 
Lp Padding_512 a 


Figure 3. The process of padding 


After the process of padding, the padded message must be cut into blocks of the same size, the formula 
(1) is used to get 1600 bit data, that will be converted subsequently to a state, data=(Block xor r)||C. Finally, 
the 1600-bit data will be converted into three-dimensional arrays (state) by applying in (1). 


State (x, y, z)=f(z+64*(5*y+x)) (1) 
As forced by the algorithm, during the first iteration an initial zero-state is utilized (Bitrate ro and capacity Co.) 


[7]. The state of SHA-3 constitutes 5x5 matrixes of 64-bit words, as appeared in Figure 4 with: 0<x<4 ; 0<p<4 ; 
0<2<63. 


Figure 4. The state of SHA-3 
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At the beginning of the round, a two-to-one multiplexer is used, this Multiplexer drives the data from 
the preprocessing block to the main component of SHA-3. The principal functionality of this Mux allows 
processing the twenty-four rounds. When Ctrl_Mux is low, the multiplexer selects the input data and at high, 
the Mux will select the feedback data. In other words, the input data are selected for the first round and the 
feedback data for the other twenty-three rounds. 

The output of the Mux is forwarded to the compression functions which are the main unit of SHA-3. 
During this step, the computation part of the algorithm will be done. This module consists of 24 processing 
rounds of five permutations (8, p, 2, y, and t). The mathematical formulas of these permutations are given 
above. At the end of the round, a register is used in order to synchronize the data path, the utility of this register 
is to store the result of each round. Then a 1x2 demultiplexer is used to process the twenty-four rounds of the 
compression function: i) the demultiplexer receives the feedback data when Ctrl_DMux is low else, ii) the 
DMux will select the input of truncating block. 

The hash value is generated after 24 iterations, so 24 constant values are needed to compute the hash 
result. In our proposed architecture, the appropriate round constants (RC) that are used in the last sub-function 
Iota are selected using a counter (0 to 23) generated by the control unit which this counter is the address bus of 
the ROM where we stored the RC constants defined by the algorithm [7]. Finally, after twenty-four rounds the 
output is simply truncated for producing the final hash value with the desired length. It should be noted that 
before generating the final hash value, it should apply the inversion procedure of mapping to get the right 
output. The HR signal indicates that the hash result is ready. 

As mentioned previously the design is controlled using the control unit, which is implemented using 
an FSM. This FSM is used to control and synchronize the different components of the SHA-3 design. The 
control unit has 3 inputs ST, Reset, and Clk, three 1-bit outputs Ctrl_Mux, Ctrl_DMux, HR, and a 5-bit output 
CMPT. The FSM includes three states, state S7 or the initial state is for the preprocessing of the input data, 
state S2 for the main computation thus it includes a counter that counts up to 24, which corresponds to 24 
iterations of the transformation round. S3 is used for the truncating of the message block and indicates that the 
hash result is ready by giving ‘1’ to the HR signal. The design needs one clock cycle to operate one round. 
Consequently, 24 clock cycles are needed to compute 24 iterations. The hash value will be ready after twenty- 
four clock cycles. 

Based on (2), the Throughput of the SHA-3 can be improved in two ways, either increasing the 
numerator presented by Frequency or decreasing the denominator (clock cycles). In this work, we opted for 
the second possibility by using the pipelining technique. The effect of using pipelining is to reduce the number 
of clock cycles required in each round, which implies increasing the Throughput according to (2). Four different 
designs are proposed two, three, four, and six-stage pipelines. Beginning with the case of two-stage pipelines. 
The design Figure 5 permits yielding the first block after 24 cycles and delivers a new block every 12 cycles. 
This allowed processing multiple messages for hashing simultaneously. This core contains the implementation 
of the five sub-functions and the output of this component will be stored in a register. The output of the register 
is fed as input to the following stage multiplexer. At the last cycle, the output is fed to the truncate unit to 
produce the final hash value. The 5-bit counter will be replaced by a 4-bit counter that steers the twelve 
iterations at each core. It should be mentioned that the size of the ROMs used to store the constant values will 
be adjusted according to the number of cores used in each case. In the case of the two-stage pipeline, two cores 
are needed instead of a single core. Accordingly, the size of the constant unit is adjusted Figure 5. 

To describe the pipelining mechanism, assuming that we have 3 messages of one block or more 
denoted Brn, the block n of the message m, during the first 12 clock cycles the block B;; will be processed by 
the first stage, and the second stage is empty. Then, during the last 12 clock cycles, B;; will be processed by 
the 2nd stage and stage 1 receives block B2;. In the same way, the blocks By2, B22, B13, B23, ... will be processed 
by the 2 stages. If there are other blocks of the first 2 messages, they must be processed otherwise, it is the turn 
of the first block of the 3rd message B3; to be hashed and so on. At the last cycle, the output is fed to the 
truncating unit to be truncated and produce the final hash value. 

The three-stage pipeline design allows outputting the first block after exactly 24 clock cycles and can 
process a new block every 8 clock cycles instead of 12 in the case of pipelining 2 stages. The 4-bit counter will 
be replaced by a 3-bit counter that steers the six iterations at each core of the three cores and the appropriate 
round constants. The four-stage pipeline design allows the output of the first block after exactly 24 clock cycles 
and can process a new block (receive the block and compute its hash value) every 6 clock cycles. In the case 
of the four-stage pipeline, it is necessary to also replace the 4-bit counter with a 3-bit one so as to count up to 
6. The case of a six-stage pipeline that permits the generation of the first block after 24 cycles and delivers a 
new block every 4 cycles as depicted in Figure 6. To ensure the good functioning of the system, the 3-bit 
counter will be replaced by a 2-bit counter._Six counters are needed in this case as appeared in Figure 6. Also, 
the size of the ROMs used to store the constant values will be adjusted. 
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Figure 6. Pipelining Six-stages 


4. RESULTS AND DISCUSSION 


This section presents the results obtained in each case 2, 3, 4, and 6 pipelining stages. Also, 
comparisons between our results and the results of previous implementations already done for SHA-3 will be 
presented. The SHA-3 hardware architectures (for all cases) were coded in VHDL, synthesized, and 
implemented in V5 and V6 FPGA devices using the Xilinx ISE Design Suite v.12.1. Their correct functionality 
was, initially, verified through Post-Place and Route simulation via the ModelSim simulator. 

The measurements were focused on the used FPGA Area resources, Frequency, design output 
Throughput, and Efficiency. Two FPGA boards were used: a Xilinx Virtex 5 XC5VLX110-1ff1760, and a 
Xilinx Virtex 6 XC6VLX760-1ff1760. Generally, there are three primary definitions of speed depending on 
the context of the problem: Throughput, Latency, and Timing [32]. 
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The Throughput is obtained by using (2): 


block size(r) xfrequency 


Throughput = eres (2) 
where Latency presents the number of clock cycles needed to compute the hash value. 
The Efficiency is calculated by using (3). 
ante _ Throughput 
Ef ficiency = — (3) 


The complete implementation results are given in Tables 2 and 3. Also, a comparison between the 
achieved results and the previous works is given in Table 4. For all the considered FPGA families, the 
Throughput increases about linearly with the number of the pipeline stages as depicted in Table 2. This is 
justified by the fact that the number of clock cycles is the dominant factor of the Throughput (2). The design 
reaches a maximum Throughput of 102.98 Gbps on Virtex 5 and 115.124 Gbps on Virtex 6 in the case of the 
6 stages. 


Table 2. Implementation results 
FPGA devices Stages Latency Maximum frequency (MHz) Throughput (Gbps) 
TP.224 TP.256 TP.384 = TP.512 
(r=1125)  (r=1088) (r=832) _(r=576) 


Virtex5 1 24 338.409 15.86 15.34 11.73 8.12 
2 12 338.40 31.72 30.68 23.46 16.24 
3 8 354.86 49.90 48.26 36.90 25.55 
4 6 354.86 66.54 64.35 49.21 34.07 
6 4 366.166 102.98 99.60 76.16 52.73 
Virtex6 1 24 376.081 17.63 17.05 13.04 9.02 
2 12 393.10 36.85 35.64 27.25 18.87 
3 8 408.163 57.40 55.51 42.45 29.38 
4 6 426.257 79.92 77.29 59.11 40.92 
6 4 409.33 115.124 111.34 85.140 58.94 
Table 3. Efficiency results 
FPGA devices stages Latency Area (Slices) Efficiency (Mbps/Slices) 
Efficiency Efficiency Efficiency Efficiency 
224 256 384 512 
VirtexS 1 24 935 16.96 16.40 12.54 8.68 
2 12 2366 13.40 12.96 9.91 6.86 
3 8 2759 18.08 17.49 13.37 9.26 
4 6 4061 16.38 15.84 12.12 8.38 
6 4 6240 16.50 15.96 12.20 8.45 
Virtex 6 1 24 1019 17.30 16.73 12.80 8.85 
2 12 2499 14.74 14.26 10.90 7.55 
3 8 3607 15.91 15.38 11.77 8.14 
4 6 4396 18.18 17.58 13.45 9.30 
6 4 6935 16.60 16.05 12.27 8.50 


Table 4. Comparison of results on Virtex 5 and Virtex 6 for 512 bits output length 


References Area Frequency Throughput (Gbps) Efficiency 
(slices) (MHz) (Mbps/slices) 
v5 V6 v5 V6 v5 V6 v5 V6 

[14] 4793 - 317.11 - 12.68 - 2.71 - 
[11 1702 1649 389 397 18.7 19.1 10.99 11.58 

[12] 240 - 301.02 - 7.224 - 30.1 - 
[33] 4361 5528 328.2 401.2 7.87 9.62 1.80 1.74 
[34 1388 1167 287.39 394.01 11.50 15.76 8.28 13.50 
[13] 1647 1181 137.01 251.7 2.39 4.39 1.45 3712 
[35 7224 10120 78.13 95.70 11.25 13.78 1.55 1.36 

[17] 1192 - 223 - 5.35 - 4.49 - 

[36 2573 - 285 - 5.70 - 2.21 - 
Our approach 6240 6935 366.166 409.33 52.72 58.94 8.44 8.49 

(6 stages) 
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On the other hand, due to the use of 6 cores, we notice that there is a significant increase in terms of 
Area, our design (Six-stage) requires 6240 slices on Virtex 5 and 6935 slices on Virtex 6. As can be noted, the 
throughput increases noticeably in Table 2, this is justified by the fact that the number of clock cycles is the 
dominant factor of the throughput equation as shown in (2). On the downside, the area occupied by the 
algorithm increase with the increase of the number of cores used in each one of the cases. The maximum 
Frequency and the Throughput are improved when moving to more modern FPGA families, the values of the 
Frequency and the Throughput obtained in the recent families are much better compared to the old families. 

Table 4 shows a comparison between the proposed design with previous works in terms of Area, 
Frequency, Throughput (TP), and Throughput per area (TPS) or Efficiency for 512 bits output length, for FPGA 
implementation on Virtex 5 and Virtex 6. By comparing our implementation results with the results of the 
previous works, our contribution outperforms significantly in terms of Throughput 52.72 Gbps on Virtex 5 and 
58.94 Gbps on Virtex 6 as shown in Table 4. On the downside, speed enhancement offered through pipelining 
comes at the expense of a large hardware area Cost 6240 slices. 


5. CONCLUSION 

In this paper pipelining architecture for the SHA-3 algorithm was proposed. We have implemented 
two, three, four, and six pipelining stages and studied the change of the performance values in each case. This 
work shows the impact of using the pipelining technique on SHA-3’s performance. A noticeable enhancement 
of the Throughput with the increase of the stage’s number, as in the case of the six-stage pipeline has the highest 
Throughput. On the other hand, this technique has an adversary effect on the area, which increases with the 
increase of the number of stages. 
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